Transaction processing by the consumer
When configuring the Qlik Replicate MapR Streams endpoint, users can configure various settings that affect where messages are published within the MapR Streams infrastructures (topics/partitions).
During a task's CDC stage, committed changes that are detected by the Qlik Replicate source endpoint are grouped by transaction, sorted internally in chronological order, and then propagated to the target endpoint. The target endpoint can handle the changes in various ways such as applying them to the target tables or storing them in dedicated Change Tables.
Each CDC message has both a transaction ID as well as change sequence. As the change sequence is a monotonically growing number, sorting events by change sequence always achieves chronological order. Grouping the sorted events by transaction ID then results in transactions containing chronologically sorted changes.
However, as MapR Streams is a messaging infrastructure, applying changes is not feasible while storing changes in tables is meaningless. The Replicate MapR Streams endpoint, therefore, takes a different approach, which is to report all transactional events as messages.
How it works
Each change in the source system is translated to a data message containing the details of the change including the transaction ID and change sequence in the source. The data message also includes the changed columns before and after the change. As explained above, the order in which the MapR Streams target writes the messages is the same as order of changes within each transaction.
Once a data message is ready to be sent to MapR Streams, the topic and partition it should go to are determined by analyzing the endpoint settings and any transformation settings. For example, the user might decide to configure the endpoint in such a way that every table is sent to a different topic and set the partition strategy to "Random", meaning that each message (within the same table) will be sent to a different partition.
Transaction consistency from a consumer perspective
If maintaining transaction consistency is important for the consumer implementation, it means that although the transaction ID exists in all data messages, the challenge is to gather the messages in a way that would facilitate identifying a whole transaction. An additional challenge is getting the transaction in the original order they were committed, which could be an even greater challenge if transactions are spread across multiple topics and partitions.
The simplest way of achieving the above goal is to direct Replicate to a specific topic and a specific partition (in the endpoint settings). This means that all data messages will end up in a single partition, thus guaranteeing ordered delivery both of transactions and of changes within a transaction. The consuming application could then consume messages - accumulating a transaction in some intermediate memory buffer - and when a new transaction ID is detected, mark the previous transaction as completed.
Although the simple way may work, it’s not very efficient at the task level as all messages end up in the same topic and partition, not necessarily utilizing the full parallelism of the MapR Streams cluster. This may be a non-issue if there are multiple tasks, each taking advantage of a different topic/partition. In such as scenario, the gathering of messages from those tasks may very well utilize the cluster optimally.
The more generic way where data may be spread over multiple topics and partitions means that some intermediate buffer such as memory, a table in a relational database, or even other topics would need to be used to collect information about transactions. Then, the transactions would need to be rebuilt by periodically (every few minutes/hours) sorting the events collected from Replicate’s MapR Streams output by the change sequence and grouping them by transaction ID.